Detecting relationships in unstructured text

ABSTRACT

Disclosed are embodiments of a system and a method for detecting relationships described in unstructured text-based electronic documents. The system and method incorporate the use of an input file that contains one or more text patterns that represent particular relationships. The text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship. Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of proper names within the document(s). Then, a pattern matcher scans the document(s) to match text patterns. If a text pattern is matched within a document a relationship detector extracts all pairs of proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in the relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention generally relates to the field of data mining and, moreparticularly, to a system and a computer-implemented method of detectingrelationships by creating input files of text patterns for each type ofrelationship, identifying a specific text pattern within a text-baseddocument, tagging proper names in the text-based document, andextracting those proper names located within the specific text patternso as to identify the two entities in the relationship.

2. Description of the Related Art

Recently, there has been a rapid growth of on-line discussion groups andnews websites on the World Wide Web (WWW). Detecting relationshipsbetween entities (e.g., buyer/seller, employee/employer, partnerships,parent/subsidiaries, etc.) discussed on those websites could prove to bea valuable resource (e.g., to a company investigating a rival company'sbusiness dealings, to a company or individual investigating aprospective client, employee, or contractor, etc.). However, the task ofmanually detecting such relationships from amongst the large corpus ofdocuments contained on the Web is laborious. Thus, there is a need for asystem and computer-implemented method for automatically and accuratelydetecting relationships in unstructured text contained within electronicdocuments with minimal processing times so as to be scalable to largedocument sets. The challenge is both in identifying entities in adocument and in detecting the particular relationship, if any, betweentwo entities.

SUMMARY OF THE INVENTION

In view of the foregoing, embodiments of the invention provide a systemand a computer implemented method of detecting relationships inunstructured text.

An embodiment of a method of detecting relationships in unstructuredtext comprises first creating text patterns that represent differenttypes of relationships and storing those text patterns in an input file.For example, the input file can store various text patterns representingemployer/employee relationships, various patterns representingpartnership relationships, etc. The text patterns may be custom-createdby a user and input into the input file and/or pre-created and stored inthe input file by a system manufacturer. A text pattern may be createdby developing at least one regular text expression, comprising aplurality of words that describe the particular type of relationship.Additionally, the text pattern is developed with two or more slotspositioned within, before, or after this regular text expression. Theseslots will be used in subsequent method steps, as described below, inorder to identify the proper names of the entities involved in therelationship (e.g., a first slot for the name of the first entity and asecond slot for the name of the second entity in the relationship). Thetext pattern can also be created with slot location identifiers whichindicate a position of the first slot and/or a position of the secondslot relative to the regular text expression. For example, the textpattern can be created with slot location identifiers that indicate thatthe first slot should be located before the text expression and/orwithin a predetermined proximity from the text expression (e.g., withina predetermined number of words from the text expression). Similarly,the text pattern can be created with slot location identifiers toindicate that the second slot should be located after the textexpression and/or within a predetermined proximity from the textexpression. Additionally, the text pattern can be created with arelationship order identifier (i.e., an identifier that defines an orderof the first and second entities in the relationship based on thelocations of the proper names within the first and second slots). Forexample, if the type of relationship detected is a customer/sellerrelationship in which one entity is a “customer of” another entity, arelationship order identifier can be embedded in the text pattern toindicate that the proper name located in the first slot identifies thecustomer. Lastly, the text pattern can be created with a keyword for theparticular type of relationship, and specifically, for the particulartext pattern. This keyword may be used in subsequent method steps, asdescribed below, to screen out documents prior to conducting a patternmatching analysis.

In addition to creating an input file, one or more text-based electronicdocuments (e.g., an unstructured text document (UTD)) are selected forprocessing by using an input device. The documents can be selected, forexample, from the world wide web (WWW), from a wide area network (WAN),from a local area network, etc. The selection of documents can include aspecific document, all documents in a specified category of documents,all documents having a specified date range, all documents matching aBoolean query of terms, etc. The selected unstructured text document(s)may be preprocessed, for example, by a preprocessor, in order to provide“noise free” text to either the proper noun tagger or the keywordidentifier, described below.

Processing of a selected text-based document comprises analyzing thedocument in order to determine the location for each proper nameoccurring within the document. This can be accomplished using amulti-step process performed, for example, by a proper noun tagger. Thetagger can be adapted to first scan the document in order to identifyeach of the proper names occurring within the document based on apredetermined set of matching rules. The set of matching rules can bebased, for example, on word capitalization, sentence structure, sentenceboundaries, excluded words, etc. The tagger can also be adapted tore-scan the document in order to tag and record each of the proper namesfound within the document along with their the locations.

Processing of a selected text-based document also comprises analyzingthe document on a sentence by sentence basis so as to locate a textpattern within the document. This can also be accomplished using amulti-step process performed, for example, by a pattern keywordidentifier and pattern matcher. The keyword identifier can be adapted tofirst scans the document in order to determine whether or not a keywordfrom one or more of the text patterns in the input file are located inthe document. If a keyword for a particular text pattern is found, thena full text pattern matching process can be performed, for example, by apattern matcher, to determine if the regular text expression defined inthe particular text pattern is located in the document. If a full textpattern is found within the document, the identity of the document isrecorded and the location of the full text pattern is flagged.

Upon detection of a full text pattern with a document, a multi-steprelationship detection process is performed, for example, by arelationship detector. The relationship detector refers to the list ofproper names recorded by the proper noun tagger and determines if propernames are located within the first and second slots and extracts thoseproper names, thereby, identifying the first and second entities engagedin the relationship. Additionally, if an order for the relationshipbetween the first and second entities is defined in the text pattern,then the relationship detector determines the order. Lastly, therelationship detector outputs the results of the relationship detectionanalysis. Specifically, the relationship detector can provide an outputcomprising the type of relationship, the names of the first and secondentities engaged in the relationship, the order of the relationship (ifapplicable) and the identification of the document and the location inthe document where the relationship was detected (i.e., the location ofthe text pattern), which can be stored and/or displayed.

An embodiment of a system for detecting relationships in one or moreunstructured text documents comprises text pattern input files, akeyword identifier, a pattern matcher, a proper noun tagger and arelationship detector.

More specifically, the system can comprise text pattern input filesstored in memory. These input files comprise text patterns that describedifferent types of relationships. The text patterns can be pre-createdand input in the input file (e.g., by a system manufacturer) or customdeveloped and input into the input file by the user using an inputdevice (e.g., a keyboard, disk, CD, internet link, hard drive, etc.).Each text pattern can comprise at least one regular text expressionhaving a plurality of words that describe a particular relationship aswell as two or more slots positioned within, before, or after thisregular text expression. The slots will be used by system features, asdescribed below, in order to identify the proper names of the entitiesinvolved in the relationship (e.g., a first slot for the name of thefirst entity and a second slot for the name of the second entity in therelationship). The text pattern can also comprise slot locationidentifiers that indicate a position of the first slot and/or a positionof the second slot relative to the regular text expression, as describedin detail above. Additionally, the text pattern can comprise arelationship order identifier that defines an order of the first andsecond entities in the relationship based on the locations of the propernames within the first and second slots, also as described in detailabove. Lastly, the text pattern can comprise a keyword for theparticular type of relationship and, specifically, for the particulartext pattern. This keyword may be used by other features of the system,as described below, to screen out documents prior to conducting apattern matching analysis.

A communications link can be established between the system and a sourcefor unstructured text documents (e.g., the world wide web (WWW), a widearea network (WAN), a local area network, etc.) so that a user of thesystem, using an input device (e.g., a keyboard, mouse, etc.) can selectone or more text-based electronic documents for analysis. The documentsmay be selected such that they include a specific document, alldocuments in a specified category of documents, all documents having aspecified date range, all documents matching a Boolean query of terms,etc. The system may further comprise a pre-processor adapted topre-process selected unstructured text document(s) prior to analysis inorder to provide “noise free” text to either the proper noun tagger orthe keyword identifier, described below.

The proper noun tagger can be adapted to receive the selectedunstructured text document(s) and to perform a multi-step taggingprocess on the documents. Specifically, the tagger can be adapted tofirst scan each document in order to identify each occurrence of aproper name within the document based on a predetermined set of matchingrules. The set of matching rules can be based, for example, on wordcapitalization, sentence structure, sentence boundaries, excluded words,etc. The tagger can also be adapted to re-scan the document in order totag and record a list of each of the proper names found within thedocument along with their the locations.

The keyword identifier is in communication with the relationship patterninput file and is adapted to receive the selected unstructured textdocument(s) (e.g., before, after, or separate from the processing by theproper noun tagger) and to analyze the document(s). Specifically, thekeyword identifier is adapted to scan each document sentence by sentencein order to determine whether or not a keyword from one or more of thetext patterns in the input file is located in the document. If a keywordfor a particular text pattern is found, the document is forwarded to apattern matcher for further analysis.

The pattern matcher is adapted to perform a full text pattern matchingprocess on the forwarded document. Specifically, the pattern matcher isadapted to scan the document sentence by sentence to determine if theregular text expression defined in the particular text patternassociated with the keyword is located in the document. If a full textpattern is found within the document, the identity of the document isrecorded, the location of the full text pattern is flagged, and thedocument is forwarded to the relationship detector.

The relationship detector is in communication with the proper nountagger and is adapted to analyze the document further in order to detecta relationship. Specifically, the relationship detector is adapted torefer to the list of proper names recorded by the proper noun tagger anddetermines if proper names are located within the first and second slotsfor the text pattern that was located in the document. If proper namesare found in both slots, the relationship detector extracts those propernames, and thereby, identifies the first and second entities engaged inthe relationship described by the text pattern. Additionally, if anorder for the relationship between the first and second entities isdefined in the text pattern, then the relationship detector determinesthe order of each named entity. Lastly, the relationship detectoroutputs the results of the relationship detection analysis.Specifically, the relationship detector can provide an output comprisingthe type of relationship, the names of the first and second entitiesengaged in the relationship, the order of the relationship (ifapplicable) and the identification of the document and the location inthe document where the relationship was detected (i.e., the location ofthe text pattern). This output can be stored (e.g., in a data storagedevice) and/or displayed on a display screen.

These and other aspects of embodiments of the invention will be betterappreciated and understood when considered in conjunction with thefollowing description and the accompanying drawings. It should beunderstood, however, that the following description, while indicatingembodiments of the invention and numerous specific details thereof, isgiven by way of illustration and not of limitation. Many changes andmodifications may be made within the scope of the embodiments of theinvention without departing from the spirit thereof, and the inventionincludes all such modifications.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments of the invention will be better understood from thefollowing detailed description with reference to the drawings, in which:

FIG. 1 is a schematic flow diagram of an embodiment of a method ofdetecting relationships in unstructured text-based electronic documents;

FIG. 2 is a schematic block diagram of an exemplary relationship textpattern input file;

FIG. 3 is a schematic block diagram representing an embodiment of asystem of detecting relationships in unstructured text-based electronicdocuments; and,

FIG. 4 is a schematic representation of a computer system suitable foruse in detecting relationships in unstructured text-based electronicdocuments.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

The embodiments of the invention and the various features andadvantageous details thereof are explained more fully with reference tothe non-limiting embodiments that are illustrated in the accompanyingdrawings and detailed in the following description. It should be notedthat the features illustrated in the drawings are not necessarily drawnto scale. Descriptions of well-known components and processingtechniques are omitted so as to not unnecessarily obscure theembodiments of the invention. The examples used herein are intendedmerely to facilitate an understanding of ways in which the embodimentsof the invention may be practiced and to further enable those of skillin the art to practice the embodiments of the invention. Accordingly,the examples should not be construed as limiting the scope of theinvention.

As mentioned above, there is need for a system and acomputer-implemented method for automatically and accurately detectingrelationships (e.g., a partner relationship between two corporations, anemployee-employer relationship between two people, a seller-customerrelationship, etc.) in unstructured text contained within electronicdocuments with minimal processing times so as to be scalable to largedocument sets. The challenge is both in identifying entities in adocument and in detecting the particular relationship, if any, betweentwo entities. Therefore, disclosed herein are embodiments of a systemand method for detecting any type of relationship that is described inunstructured text-based electronic documents. Specifically, the systemand method each incorporate the use of an input file that contains oneor more text patterns that represent particular relationships. The textpatterns each include regular text expressions that describe theparticular relationship and slots for the location of each entity inthat relationship. Document(s) are selected by a user and scanned by aproper noun tagger that identifies and tags every occurrence of a propername within the document(s). Then, a pattern matcher scans thedocument(s) to match text patterns from the input file. If a textpattern is matched a relationship detector extracts the proper namesfound in the slots for each matched text pattern. The output from therelationship detector includes the names for each entity in arelationship, the type of relationship, and the identity of the documentand the location of the sentence describing the relationship in thedocument.

More particularly, referring to FIG. 1, an embodiment of a method ofdetecting relationships in unstructured text comprises first creatingtext patterns 205 that represent different types of relationships 201and storing those text patterns 205 in an input file 200, as illustratedin FIG. 2 (102-104). For example, the input file 200 can store varioustext patterns representing different types of relationships 201, such asemployer/employee relationships, various patterns representingpartnership relationships, etc. The text patterns 205 may be customcreated and input into the input file 200 by a user and/or pre-createdand stored in the input file 200 by a system manufacturer (e.g., asillustrated in FIG. 3 and discussed below). Any number of input filesmay be given as input with each file containing a list of patterns 205for a particular relationship 201.

Specifically, the text patterns 205 may be created by developing atleast one regular text expression 210, comprising a plurality of wordsthat describe the particular type of relationship, and providing two ormore slots 208, 212 positioned within, before, or after this regulartext expression. These slots will be used in subsequent method steps, asdescribed below, in order to identify the proper names of the entitiesinvolved in the relationship (e.g., a first slot for the name of thefirst entity and a second slot for the name of the second entity in therelationship). The text pattern 205 can also be created with slotlocation identifiers 202 that indicate a position of the first slotand/or a position of the second slot relative to the regular textexpression. For example, the text pattern 205 can be created with a slotidentifier 202 a to indicate that the first slot 208 should be locatedbefore the text expression 210 and/or within a predetermined proximityfrom the text expression (e.g., within a predetermined number of wordsfrom the text expression). Similarly, the text pattern 205 can becreated with a slot identifier 202 b to indicate that the second slot212 should be located after the text expression 210 and/or within apredetermined proximity from the text expression. Additionally, the textpattern 205 can be created with a relationship order identifier 204(i.e., an identifier that defines an order of the first and secondentities in the relationship that is not symmetric based on thelocations of the proper names within the first and second slots 208 and212). For example, if the type of relationship detected is acustomer/seller relationship in which one entity is a “customer of”another entity, a relationship order identifier can be embedded in thetext pattern to indicate that the proper name located in the first slotidentifies the customer. Lastly, the text pattern 205 can also becreated with a keyword 206 for the particular type of relationship 201,and specifically, for the particular text pattern. This keyword 206 maybe used in subsequent method steps, as described below, to screen outdocuments prior to conducting a pattern matching analysis and to,thereby, improve the speed of the pattern-matching.

For example, the text patterns 205 may be described in any language thatsupports regular expression matching (e.g., Perl) such that the slots208 and 212 for the entities match the $1 and $2 variables after asuccessful match is performed. The following illustrates an exemplarytext pattern for a customer relationship between two corporations:

O,1,1,awarded,(,*)(?:has)awarded (.*) a (?:[ˆ]* ){O,3}contract

This exemplary text pattern contains four comma-separated fields: (1) afirst number (i.e., slot location identifier 202), (2) a second number(i.e., another slot location identifier 202), (3) a third number (i.e.,a relationship order identifier 204), (4) a keyword (206) for thepattern, and (5) a regular expression 210 with two slots 208 and 210.Pursuant to Perl syntax, the text matching the first (.*) in theexpression will be accessible via the $1 variable after a successfulmatch has been performed. Similarly, the second occurrence will beaccessible via the $2 variable. These two variables describe thelocation of the slot for each entity in the pattern. When a match isperformed, these slots may contain an arbitrary amount of text. When thematching is performed, proper names are located within the slots. Thefirst two numbers in this exemplary text pattern comprise the slotlocation identifiers 202 and refer to the text matched in the $1 and $2slots, respectively. A 0 means that for a successful match, a propername found within the slot must occur to the far right, a 1 means itmust occur to the far left. The third number in the exemplary textpattern comprises the relationship order identifier which specifies theorder of the entities in the relationship. For example, if therelationship is “customer of,” a 1 in this field means that entity 1(matched via $1) is a customer of entity 2 (matched via $2). A 2 in thisfield would mean that entity 2 is a customer of entity 1. If this fieldis 0, the relationship is symmetric, as in a partnership relation.

At the start of the process, all input files and corresponding textpatterns are loaded into memory and a mapping is created fromrelationship type 201 to the set of patterns 205 for that relationship.

Referring again to FIG. 1, in addition to creating an input file, one ormore text-based electronic documents (e.g., an unstructured textdocument (UTD)) are selected for processing by using an input device(e.g., the same or a different input device than that used to create andinput input files) (106). The documents can be selected, for example,from the world wide web (WWW), from a wide area network (WAN), from alocal area network, etc. The selection of documents can include aspecific document, all documents in a specified category of documents,all documents having a specified date range, all documents matching aBoolean query of terms, etc. Each selected unstructured text documentmay be preprocessed, for example, by a preprocessor, in order to provide“noise free” text to either the proper noun tagger or the keywordidentifier, described below (108).

Processing of each selected text-based document comprises analyzing thedocument in order to determine the location for each proper nameoccurring within the document (116). This can be accomplished using amulti-step process performed, for example, by a proper noun tagger. Thetagger can be adapted to first scan the document in order to identifyeach of the proper names occurring within each document based on apredetermined complex set of matching rules and lexicons. The set ofmatching rules define the proper nouns based, for example, on wordcapitalization, sentence structure, sentence boundaries, excluded words,etc. For example, the list of excluded words may include months, days ofthe week, words not capitalized in a title, etc. An exemplary rule mayprovide that all capitalized words, not located at the beginning of asentence and not included on the list of excluded words, areidentifiable as proper nouns. The tagger can also be adapted to re-scanthe document in order to tag and record a list of each of the propernames found within the document along with their the locations.

Processing of each selected text-based document also comprises analyzingthe document on a sentence by sentence basis so as to locate a textpattern within the document (112-114). This can also be accomplishedusing a multi-step process performed, for example, by a pattern keywordidentifier and pattern matcher. The keyword identifier can be adapted tofirst scans the document in order to determine whether or not a keywordfrom one or more of the text patterns in the input file are located inthe document (112). If a keyword for a particular text pattern is found,then a full text pattern matching process can be performed, for example,by a pattern matcher, to determine if the regular text expressiondefined in the particular text pattern is located in the document (114).If a full text pattern is found within the document, the identity of thedocument is recorded and the location of the full text pattern isflagged (115).

Upon detection of a full text pattern within a document, a multi-steprelationship detection process is performed, for example, by arelationship detector. The relationship detector refers to the list ofproper names recorded by the proper noun tagger and determines if propernames are located within the first and second slots and extracts thoseproper names, thereby, identifying the first and second entities engagedin the relationship (119). Additionally, if an order for therelationship between the first and second entities is defined in thetext pattern, then the relationship detector determines the order.Lastly, the relationship detector outputs the results of therelationship detection analysis (120). Specifically, the relationshipdetector can provide an output comprising the type of relationship, thenames of the first and second entities engaged in the relationship, theorder of the relationship (if applicable) and the identification of thedocument and the location in the document where the relationship wasdetected (i.e., the location of the text pattern), which can be stored(122) and/or displayed (124).

Referring to FIG. 3, an embodiment of a system 300 for detectingrelationships in one or more unstructured text documents comprises atext pattern input file 304, a keyword identifier 312, a pattern matcher314, a proper noun tagger 316 and a relationship detector 318.

More specifically, the system 300 can comprise input files 304 stored inmemory 305 (e.g., a hard drive, a disk, data storage device, etc.). Theinput files 304, as illustrated in FIG. 2 and discussed above, comprisea plurality of text patterns 205 that describe different types ofrelationships 201. These text patterns 205 can be pre-created and inputin the input files 304 (e.g., by a system manufacturer) or customdeveloped and input into the input files 304 by the user using an inputdevice 302 (e.g., a keyboard, disk, CD, internet link, hard drive,etc.).

Each text pattern 205 can comprise at least one text expression 210,discussed in detail above, having a plurality of words that describe aparticular relationship 201 as well as two or more slots 208, 212positioned within, before, or after this regular text expression 210.The slots 208, 212 will be used by the relationship detector 318, asdescribed below, in order to identify the proper names of the entitiesinvolved in the relationship (e.g., a first slot 208 for the name of thefirst entity and a second slot 212 for the name of the second entity inthe relationship). The text pattern 205 can also comprise slot locationidentifiers 202 a-b that indicate a position of the first slot 208and/or a position of the second slot 212 relative to the regular textexpression 210, as described in detail above. Additionally, the textpattern 205 can comprise a relationship order identifier 204 thatdefines an order of the first and second entities in the relationshipbased on the locations of the proper names within the first and secondslots 208, 212, also as described in detail above. Lastly, the textpattern 205 can comprise a keyword 206 for the particular type ofrelationship 201 and, specifically, for the particular text pattern 205.This keyword 206 may be used by the keyword identifier 312, as describedbelow, to screen out documents prior to conducting a pattern matchinganalysis in order to improve processing speed.

A communications link 307 can be established between the system 300 anda source 306 for unstructured text documents (e.g., the Internet, theworld wide web (WWW), a wide area network (WAN), a local area network,etc.) so that a user of the system 300 can select, using an input device308 (e.g., a keyboard, a mouse, etc.) one or more text-based electronicdocuments 309 for analysis. The document(s) may be selected to includespecific document(s), all documents in a specified category ofdocuments, all documents having a specified date range, all documentsmatching a Boolean query of terms, etc. The system 300 may furthercomprise a pre-processor 310 adapted to pre-process each selectedunstructured text document 309 prior to analysis in order to provide“noise free” text to either the proper noun tagger 315 or the keywordidentifier 312, described below.

The proper noun tagger 315 can be adapted to receive each selectedunstructured text document 309 and to perform a multi-step taggingprocess on the documents. Specifically, the tagger 315 can be adapted tofirst scan each document in order to identify each occurrence of aproper name within the document based on a predetermined and complex setof matching rules and lexicons. The set of matching rules can be based,for example, on at least one of word capitalization, sentence structure,sentence boundaries, and excluded words (e.g., as illustrated in thedetail discussion above). The tagger 315 can also be adapted to re-scanthe document(s) 309 in order to tag each proper name and record a listof each of the proper names found within the document along with theirthe locations 317 in memory 318.

The keyword identifier 312 is in communication with (i.e., is adapted toaccess) the relationship pattern input file 304 and is further adaptedto receive the selected unstructured text document(s) 309 from thepreprocessor 310 (e.g., before, after, or separate from the processingby the proper noun tagger) and to analyze each document 309.Specifically, the keyword identifier 312 is adapted to scan eachdocument 309 sentence by sentence in order to determine whether or not akeyword 206 from one or more of the text patterns 205 in the input file(as illustrated in FIG. 2) are located in the document 309. If a keyword206 for a particular text pattern 205 is found, the document containingthe keyword is forwarded to a pattern matcher 314 for further analysis.

The pattern matcher 314 is adapted to perform a full text patternmatching process on the forwarded document. Specifically, the patternmatcher 314 is adapted to scan the document sentence by sentence todetermine if the regular text expression defined in the particular textpattern associated with the keyword is located in the document. If afull text pattern is found within the document, the identity of thedocument and the location of the full text pattern 320 is recorded in amemory 319 that is accessible by the relationship detector 318. Thedocument that contains the full text pattern is then forwarded to therelationship detector 318 for further analysis.

The relationship detector 318 is adapted to further analyze the documentthat contains the full text pattern in order to detect a relationshipand, particularly, the entities engaged in the relationship.Specifically, the relationship detector 318 is adapted to access thememory 316 in order to refer to the list of proper names 317 recorded bythe proper noun tagger 315. The relationship detector 318 then reviewsthe document and determines if proper names are located within the firstand second slots for the text pattern that was located within thedocument. If proper names are found in both slots, the relationshipdetector 318 extracts those proper names, and thereby, identifies thefirst and second entities engaged in the relationship described by thetext pattern. Additionally, if an order for the relationship between thefirst and second entities is defined in the text pattern, then therelationship detector 318 determines the order of each named entity.Lastly, the relationship detector outputs the results of therelationship detection analysis. Specifically, the relationship detector318 can provide an output comprising the type of relationship (asdefined by the text pattern), the names of the first and second entitiesengaged in the relationship, the order of the relationship (ifapplicable) and the identification of the document and the location inthe document where the relationship was detected (i.e., the location ofthe text pattern). This output can be stored (e.g., in a data storagedevice 322) and/or displayed on a display screen 324.

Embodiments of the system 300, described above, can take the form of anentirely hardware embodiment, an entirely software embodiment or anembodiment including both hardware and software elements. In a preferredembodiment, the invention is implemented using software, which includesbut is not limited to firmware, resident software, microcode, etc.Furthermore, embodiments of the system 300 can take the form of acomputer program product accessible from a computer-usable orcomputer-readable medium providing program code for use by or inconnection with a computer or any instruction execution system. For thepurposes of this description, a computer-usable or computer readablemedium can be any apparatus that can comprise, store, communicate,propagate, or transport the program for use by or in connection with theinstruction execution system, apparatus, or device. The medium can be anelectronic, magnetic, optical, electromagnetic, infrared, orsemiconductor system (or apparatus or device) or a propagation medium.Examples of a computer-readable medium include a semiconductor or solidstate memory, magnetic tape, a removable computer diskette, a randomaccess memory (RAM), a read-only memory (ROM), a rigid magnetic disk andan optical disk. Current examples of optical disks include compactdisk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) andDVD. A data processing system suitable for storing and/or executingprogram code will include at least one processor coupled directly orindirectly to memory elements through a system bus. The memory elementscan include local memory employed during actual execution of the programcode, bulk storage, and cache memories which provide temporary storageof at least some program code in order to reduce the number of timescode must be retrieved from bulk storage during execution.

FIG. 4 is a schematic representation of an exemplary computer system 400suitable for use in detecting relationships as described herein.Computer software executes under a suitable operating system installedon the computer system 400 to assist in performing the describedtechniques. This computer software is programmed using any suitablecomputer programming language, and may be though of as comprisingvarious software code means for achieving particular steps. Thecomponents of the computer system 400 include a computer 420, a keyboard410 and a mouse 415, and a video display 490. The computer 420 includesa processor 440, a memory 450, input/output (I/O) interfaces 460, 465, avideo interface 445, and a storage device 455. The processor 440 is acentral processing unit (CPU) that executes the operating system and thecomputer software executing under the operating system. The memory 450includes random access memory (RAM) and read-only memory (ROM), and isused under direction of the processor 440. The video interface 445 isconnected to video display 490. User input to operate the computer 420is provided from the keyboard 410 and mouse 415. The storage device 455can include a disk drive or any other suitable storage medium. Each ofthe components of the computer 420 is connected to an internal bus 430that includes data, address, and control buses, to allow components ofthe computer 420 to communicate with each other via the bus 430. Thecomputer system 400 can be connected to one or more other similarcomputers via input/output (I/O) interface 465 using a communicationchannel 465 to a network, represented as the Internet 480. The computersoftware may be recorded on a portable storage medium, in which case,the computer software program is accessed by the computer system 400from the storage device 455. Alternatively, the computer software can beaccessed directly from the Internet 480 by the computer 420. In eithercase, a user can interact with the computer system 400 using thekeyboard 410 and mouse 415 to operate the programmed computer softwareexecuting on the computer 420. Other configurations or types of computersystems can be equally well used to implement the described techniques.The computer system 400 described above is described only as an exampleof a particular type of system suitable for implementing the describedtechniques.

Therefore, disclosed above are embodiments of a system and a method fordetecting relationships described in unstructured text-based electronicdocuments. The system and method incorporate the use of an input filethat contains one or more text patterns that represent particularrelationships. The text patterns each include regular text expressionsthat describe the particular relationship and slots for the location ofeach entity in that relationship. Document(s) are selected by a user andscanned by a proper noun tagger that identifies and tags everyoccurrence of proper names within the document(s). Then, a patternmatcher scans the document(s) to match text patterns. If a text patternis matched within a document a relationship detector extracts all pairsof proper names found in the slots for each matched text pattern. Theoutput from the relationship detector includes the names for each entityin the relationship, the type of relationship, and the identity of thedocument and the location of the sentence describing the relationship inthe document. This method and associated system are extremely cost andtime efficient because they avoid the need of natural languageprocessing or parsing (i.e., running expensive machines such as parsersand parts-of-speech taggers is unnecessary), so that they are scalableto a large number of documents. Additionally, because a user may definethe text patterns with regular text expressions (as opposed to a singleword or simple phrase) describing each relationship, the system andmethod are applicable to any type of relationship and are very precisein detecting particular relationships.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the invention that others can, by applyingcurrent knowledge, readily modify and/or adapt for various applicationssuch specific embodiments without departing from the generic concept,and, therefore, such adaptations and modifications should and areintended to be comprehended within the meaning and range of equivalentsof the disclosed embodiments. It is to be understood that thephraseology or terminology employed herein is for the purpose ofdescription and not of limitation. Therefore, while the invention hasbeen described in terms of preferred embodiments, those skilled in theart will recognize that the invention can be practiced with modificationwithin the spirit and scope of the appended claims.

1. A computer implemented method of detecting a relationship between afirst entity and a second entity, said method comprising: creating atext pattern that represents a type of relationship, wherein said textpattern comprises a first slot for said first entity and a second slotfor said second entity; analyzing a text-based document so as to locatesaid text pattern within said document; determining a location for eachproper name occurring within said document; and extracting proper nameslocated within said first slot and said second slot of said text patternwithin said document, wherein said proper names located within saidfirst slot and said second slot identify said first entity and saidsecond entity.
 2. The method of claim 1, wherein said creating of saidtext pattern further comprises identifying a keyword describing saidrelationship and wherein said method further comprises before saidanalyzing of said document, reviewing said document to determine if saidkeyword is located in said document.
 3. The method of claim 1, whereinsaid creating of said text pattern further comprises: creating at leastone text expression comprising a plurality of words that describe saidtype of said relationship; and setting a position of said first slot andsaid second slot relative to said at least one text expression.
 4. Themethod of claim 1, wherein said determining of said location of each ofsaid proper names comprises: scanning said document to identify each ofsaid proper names occurring within said document based on a set ofmatching rules; re-scanning said document to tag said location for eachof said proper names identified; and recording said location for each ofsaid proper names.
 5. The method of claim 4, wherein said set ofmatching rules is based on at least one of word capitalization, sentencestructure, sentence boundaries, and excluded words.
 6. The method ofclaim 1, wherein said creating of said text pattern further comprisesdefining an order of said first entity and said second entity in saidrelationship based on said locations of said proper names within saidfirst slot and said second slot.
 7. The method of claim 1, furthercomprising storing a record of said relationship comprising at least oneof said proper name of said first entity, said proper name of saidsecond entity, said type of relationship between said first entity andsaid second entity, said order of said first entity and said secondentity in said relationship, and an identifier for said document and alocation in said document where said relationship is detected.
 8. Asystem for detecting a relationship between a first entity and a secondentity, said system comprising: an input file adapted to store a textpattern that describes a type of relationship, wherein said text patterncomprises a first slot for said first entity and a second slot for saidsecond entity; a pattern matcher in communication with said input fileand adapted to analyze a text-based document so as to locate said textpattern within said document; a proper noun tagger adapted to locate andrecord occurrences of proper names within said document; and arelationship detector in communication with said pattern matcher andsaid proper noun tagger and adapted to extract said proper names locatedwithin said first slot and said second slot of said text pattern withinsaid document so as to identify said first entity and said second entityand, thereby, detect said relationship.
 9. The system of claim 8,further comprising a keyword identifier in communication with said inputfile and adapted to review said document for said keyword and to forwardsaid document to said pattern matcher only if said keyword is located insaid document.
 10. The system of claim 8, wherein said text patternfurther comprises: at least one text expression comprising a pluralityof words that describe said relationship; and positions for said firstslot and said second slot relative to said at least one text expression.11. The system of claim 8, wherein said text pattern further comprisesan order of said first entity and said second entity in saidrelationship based said locations of said proper names within said firstslot and said second slot.
 12. The method of claim 8, wherein saidproper noun tagger is further adapted to scan said document to identifyeach of said proper names occurring within said document based on a setof matching rules, to re-scan said document to tag said location foreach of said proper names, and to record said location for each of saidproper names within said document.
 13. The system of claim 12, whereinsaid set of matching rules is based on at least one of wordcapitalization, sentence structure, sentence boundaries, and excludedwords.
 14. The system of claim 11, further comprising at least one of adata storage device adapted to store at least one of said proper name ofsaid first entity, said proper name of said second entity, saidrelationship between said first entity and said second entity, saidorder of said first entity and said second entity in said relationship,and a record of said document in which said relationship is detected.15. A program storage device readable by computer and tangibly embodyinga program of instructions executable by said computer to perform amethod of detecting a relationship between a first entity and a secondentity, said method comprising: creating a text pattern that representsa type of relationship, wherein said text pattern comprises a first slotfor said first entity and a second slot for said second entity;analyzing a text-based document so as to locate said text pattern withinsaid document; determining a location for each proper name occurringwithin said document; and extracting proper names located within saidfirst slot and said second slot of said text pattern within saiddocument, wherein said proper names located within said first slot andsaid second slot identify said first entity and said second entity 16.The program storage device of claim 15, wherein said creating of saidtext pattern further comprises identifying a keyword describing saidrelationship and wherein said method further comprises before saidanalyzing of said document, reviewing said document to determine if saidkeyword is located in said document.
 17. The program storage device ofclaim 15, wherein said creating of said text pattern further comprises:creating at least one text expression comprising a plurality of wordsthat describe said type of said relationship; and setting a position ofsaid first slot and said second slot relative to said at least one textexpression.
 18. The program storage device of claim 15, wherein saiddetermining of said location for each of said proper names comprises:scanning said document to identify each of said proper names occurringwithin said document based on a set of matching rules; re-scanning saiddocument to tag said location for each of said proper names; andrecording said location for each of said proper names.
 19. The programstorage device of claim 18, wherein said set of matching rules is basedon at least one of word capitalization, sentence structure, sentenceboundaries, and excluded words.
 20. The program storage device of claim15, wherein said creating of said text pattern further comprisesdefining an order of said first entity and said second entity in saidrelationship based on said locations of said proper names within saidfirst slot and said second slot.